Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate SparseDataFrame and SparseSeries #26137

Merged
merged 29 commits into from
May 29, 2019

Conversation

TomAugspurger
Copy link
Contributor

Closes #19239

This currently includes the changes from #25682, which I think is mergeable.

I think this would be good to have for 0.25.0. I think it's close, but I may not have time to push this across the finish line. Anyone interested in finishing it off?

commit 8b136bf
Merge: 3005aed 01d3dc2
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Fri Mar 15 16:03:23 2019 -0500

    Merge remote-tracking branch 'upstream/master' into sparse-frame-accessor

commit 3005aed
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Mar 14 06:26:32 2019 -0500

    isort?

commit 318c06f
Merge: 0922296 79205ea
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Mar 14 06:25:45 2019 -0500

    Merge remote-tracking branch 'upstream/master' into sparse-frame-accessor

commit 0922296
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Mar 13 21:35:51 2019 -0500

    updates

commit f433be8
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Mar 13 20:54:07 2019 -0500

    lint

commit 6696f28
Merge: 534a379 1017382
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Mar 13 20:53:13 2019 -0500

    Merge remote-tracking branch 'upstream/master' into sparse-frame-accessor

commit 534a379
Merge: 94a7baf 5c341dc
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Tue Mar 12 14:37:27 2019 -0500

    Merge remote-tracking branch 'upstream/master' into sparse-frame-accessor

commit 94a7baf
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Tue Mar 12 14:22:48 2019 -0500

    fixups

commit 6f619b5
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Tue Mar 12 13:38:48 2019 -0500

    32-bit compat

commit 24f48c3
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Mon Mar 11 22:05:46 2019 -0500

    API: DataFrame.sparse accessor

    Closes pandas-dev#25681
pandas/util/testing.py Outdated Show resolved Hide resolved
pandas/core/sparse/series.py Outdated Show resolved Hide resolved
@gfyoung gfyoung added Deprecate Functionality to remove in pandas Sparse Sparse Data Type labels Apr 18, 2019
@codecov
Copy link

codecov bot commented May 14, 2019

Codecov Report

Merging #26137 into master will decrease coverage by 0.57%.
The diff coverage is 30.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26137      +/-   ##
==========================================
- Coverage   41.31%   40.73%   -0.58%     
==========================================
  Files         174      175       +1     
  Lines       50749    52432    +1683     
==========================================
+ Hits        20968    21360     +392     
- Misses      29781    31072    +1291
Flag Coverage Δ
#single 40.73% <30.43%> (-0.58%) ⬇️
Impacted Files Coverage Δ
pandas/util/testing.py 49.31% <100%> (+0.1%) ⬆️
pandas/core/sparse/series.py 44.24% <100%> (+0.49%) ⬆️
pandas/core/frame.py 34.57% <100%> (-0.12%) ⬇️
pandas/core/arrays/sparse.py 38.9% <21.05%> (+0.01%) ⬆️
pandas/core/sparse/frame.py 28.76% <60%> (+0.49%) ⬆️
pandas/io/gbq.py 25% <0%> (-48.69%) ⬇️
pandas/core/arrays/array_.py 15.55% <0%> (-22.23%) ⬇️
pandas/core/sorting.py 22.22% <0%> (-4.12%) ⬇️
pandas/io/formats/format.py 30.27% <0%> (-4.03%) ⬇️
pandas/core/indexes/timedeltas.py 44.38% <0%> (-3.07%) ⬇️
... and 81 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 612c244...836d19b. Read the comment docs.

@codecov
Copy link

codecov bot commented May 14, 2019

Codecov Report

Merging #26137 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26137      +/-   ##
==========================================
- Coverage   91.77%   91.76%   -0.01%     
==========================================
  Files         174      174              
  Lines       50639    50646       +7     
==========================================
+ Hits        46473    46476       +3     
- Misses       4166     4170       +4
Flag Coverage Δ
#multiple 90.3% <100%> (ø) ⬆️
#single 41.69% <45.45%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/series.py 93.61% <ø> (ø) ⬆️
pandas/core/frame.py 97% <ø> (-0.12%) ⬇️
pandas/core/generic.py 93.57% <ø> (ø) ⬆️
pandas/core/sparse/series.py 93.24% <100%> (+0.06%) ⬆️
pandas/core/arrays/sparse.py 93.07% <100%> (ø) ⬆️
pandas/core/sparse/frame.py 95.65% <100%> (+0.01%) ⬆️
pandas/core/sparse/scipy_sparse.py 100% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f0a743...12d8d83. Read the comment docs.

@@ -6,27 +6,28 @@
Sparse data structures
**********************

We have implemented "sparse" versions of ``Series`` and ``DataFrame``. These are not sparse
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the diff here is all that informative. I'd recommend just viewing the new file. The basic flow is

  • short intro
  • SparseArray / SparseDtype
  • Sparse Accessors
  • SparseIndex / computation
  • Migration Guide
  • SparseSeries / SparseDataFrame.

@TomAugspurger
Copy link
Contributor Author

Note: we still have some warnings leaking through on some of the CI jobs (just not numpydev). Trying to track those down.

@TomAugspurger
Copy link
Contributor Author

I think I got all the warnings... I added a global filterwarnings to our setup.cfg https://github.com/pandas-dev/pandas/pull/26137/files#diff-380c6a8ebbbce17d55d50ef17d3cf906. This proved helpful in tracking them down. Are people OK with keeping that there? As an aside, I couldn't get the syntax for "raise on all warnings from pandas" to work. In theory error:::pandas[.*] should do it, but that was still elevating warnings from other packages.

@jreback
Copy link
Contributor

jreback commented May 15, 2019

will have a look soon

doc/source/user_guide/sparse.rst Outdated Show resolved Hide resolved
doc/source/user_guide/sparse.rst Outdated Show resolved Hide resolved
doc/source/user_guide/sparse.rst Outdated Show resolved Hide resolved

.. code-block:: python

# Old way
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use *Previous* and *New*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change old to previous. I think I'll keep them as comments, rather than **-style headings, since we're using ** for the subtopic (e.g. construction).

doc/source/user_guide/sparse.rst Show resolved Hide resolved
doc/source/user_guide/sparse.rst Show resolved Hide resolved
doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
pandas/core/generic.py Show resolved Hide resolved
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger thanks a lot for this!

Did a first pass, and some high level comments:

  • in some older whatsnew files, we will need add some :okwarnings: for now (see the doc build on travis)
  • In the migration section, I think we also need to state some differences between old SparseDataFrame/Series and the new way. Eg:
    • It is no longer guaranteed that all columns are sparse. You can have a mixture.
    • Practical consequence of the above: assigning values to a new column of a "sparse" dataframe no longer automatically sparsifies it, you need to do that yourself
    • also related: no more a default_fill_value (but if you can't assign values with automatic sparsification, this default fill value also has no use, I think, so this is not really a problem given the above)
  • might be for a different issue, but noted this while reviewing: when having mixed sparse and non-sparse columns in a dataframe, the sparse accessor should either give a better error message (indicating that not all columns are sparse) or either work (eg density could in principle work for a mixture)
    • related to that: how to convert to dense if you have a mixture?

@@ -35,21 +36,64 @@ large, mostly NA ``DataFrame``:

df = pd.DataFrame(np.random.randn(10000, 4))
df.iloc[:9998] = np.nan
sdf = df.to_sparse()
sdf = df.astype(pd.SparseDtype("float", np.nan))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For such a purpose, I was thinking we could also provide df.sparse.to_sparse() to convert a full DataFrame to sparse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, though perhaps as a followup? I don't plan to put more time into sparse personally.

Copy link
Contributor

@jreback jreback May 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would have to be a DataFrame.astype('sparse') ? though I think, IOW, or is .sparse allowed on any DataFrame? the semantics are a bit odd on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger can you respond to this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, IOW, or is .sparse allowed on any DataFrame? the semantics are a bit odd on this

Kinda. If you just do df.sparse on a dataframe without all-sparse values, we raise

In [6]: pd.DataFrame({"A": [1, 2]}).sparse
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-ab0fb67ed650> in <module>
----> 1 pd.DataFrame({"A": [1, 2]}).sparse

...

~/sandbox/pandas/pandas/core/arrays/sparse.py in _validate(self, data)
   2119         dtypes = data.dtypes
   2120         if not all(isinstance(t, SparseDtype) for t in dtypes):
-> 2121             raise AttributeError(self._validation_msg)
   2122
   2123     @classmethod

AttributeError: Can only use the '.sparse' accessor with Sparse data.

But we also allow for pd.DataFrame.sparse.from_spmatrix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would have to be a DataFrame.astype('sparse')

It would be .astype('Sparse') which is shorthand for .astype(SparseDtype(float64, nan))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is all fine; is there a test for using .sparse on non-any-sparse df?

doc/source/user_guide/sparse.rst Outdated Show resolved Hide resolved
doc/source/user_guide/sparse.rst Show resolved Hide resolved
arr[2:5] = np.nan
arr[7:8] = np.nan
sparr = pd.SparseArray(arr)
sparr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important for this PR, but we should actually improve the repr of SparseArray. Currently the example gives

[-2.329703982704994, -0.7776235464173905, nan, nan, nan, -0.07270483900887693, 0.4093257484722553, nan, -0.33749585746785415, 1.884146289689117]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)

(so way to wide, and showing too much detail of the random numbers)

doc/source/user_guide/sparse.rst Show resolved Hide resolved
doc/source/user_guide/sparse.rst Show resolved Hide resolved

A ``.sparse`` accessor has been added for :class:`DataFrame` as well.
See :ref:`api.dataframe.sparse` for more.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See :ref:`api.dataframe.sparse` for more.
See :ref:`api.frame.sparse` for more.

pandas/core/generic.py Show resolved Hide resolved
pandas/core/sparse/frame.py Outdated Show resolved Hide resolved
pandas/core/sparse/series.py Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

I've updated the doc examples to all use Series[sparse], rather than SparseSeries. I've just left a note that the sparse subclasses are deprecated.

@TomAugspurger
Copy link
Contributor Author

c5fa3fb also has a change to Series.sparse.from_coo. Previously that was using SparseSeries internally, and so a warning was being raised. I (lazily) applied the warnings filter to the class so it was being ignored in the test.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. just a couple of points.

doc/source/user_guide/sparse.rst Outdated Show resolved Hide resolved
doc/source/user_guide/sparse.rst Show resolved Hide resolved
doc/source/user_guide/sparse.rst Show resolved Hide resolved
doc/source/user_guide/sparse.rst Outdated Show resolved Hide resolved
pandas/core/generic.py Show resolved Hide resolved
@@ -116,14 +116,19 @@ def _sparse_series_to_coo(ss, row_levels=(0, ), column_levels=(1, ),
return sparse_matrix, rows, columns


def _coo_to_sparse_series(A, dense_index=False):
def _coo_to_sparse_series(A, dense_index=False, sparse_series=True):
"""
Convert a scipy.sparse.coo_matrix to a SparseSeries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string here (types too if you can!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I'm not really sure on two things

  1. The type for A is 'scipy.sparse.coo.coo_matrix', but we can't import sparse.
  2. The return type is Union[Series, SparseSeries] but importing SparseSeries would cause a circular import

so I left types off for those.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. can't you just use the string? (I think that works)
  2. same use the string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you? Are these types actually checked in our CI? I'd rather not introduce invalid types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes they should be

s = Series(A.data, MultiIndex.from_arrays((A.row, A.col)))
s = s.sort_index()
s = s.to_sparse() # TODO: specify kind?
if sparse_series:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why exactly do you need sparse_series flag? why can't we just do the astype after calling this routine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called from both Series.sparse and SparseSeries.

Previously, this went coo_matrix -> SparseSeries -> Series[sparse], which caused an undesired warning for Series.sparse.from_coo(). Once SparseSeries is gone we can remove the keyword.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok can you add a todo about this then, this is not obvious at all

@jreback jreback added this to the 0.25.0 milestone May 19, 2019

You can apply NumPy *ufuncs* to ``SparseArray`` and get a ``SparseArray`` as a result.
Sparse-specific properties, like ``density``, are available on the ``.sparse`` accssor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accssor typo.

@@ -35,21 +36,64 @@ large, mostly NA ``DataFrame``:

df = pd.DataFrame(np.random.randn(10000, 4))
df.iloc[:9998] = np.nan
sdf = df.to_sparse()
sdf = df.astype(pd.SparseDtype("float", np.nan))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger can you respond to this

doc/source/user_guide/sparse.rst Show resolved Hide resolved
Sparse Calculation
------------------

You can apply NumPy `ufuncs <https://docs.scipy.org/doc/numpy/reference/ufuncs.html>`_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we are recommending people work directly with SparseArray? the unit of computation is generally the Series, no?

Copy link
Contributor Author

@TomAugspurger TomAugspurger May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was here before, just moved. Whether or not it makes sense, I dunno. Depends on whether or not you need / want an index I suppose.

doc/source/user_guide/sparse.rst Outdated Show resolved Hide resolved
~~~~~~~~~~~~

A :meth:`SparseSeries.to_coo` method is implemented for transforming a ``SparseSeries`` indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``.
:meth:`Series.sparse.to_coo` is implemented for transforming a ``Series`` with sparse values indexed by a ``MultiIndex`` to a ``scipy.sparse.coo_matrix``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:class:`MultiIndex`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have the doc inventory for scipy? can you add a refernce to coo_matrix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have SciPy in our intersphinx.

@@ -116,14 +116,19 @@ def _sparse_series_to_coo(ss, row_levels=(0, ), column_levels=(1, ),
return sparse_matrix, rows, columns


def _coo_to_sparse_series(A, dense_index=False):
def _coo_to_sparse_series(A, dense_index=False, sparse_series=True):
"""
Convert a scipy.sparse.coo_matrix to a SparseSeries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. can't you just use the string? (I think that works)
  2. same use the string

s = Series(A.data, MultiIndex.from_arrays((A.row, A.col)))
s = s.sort_index()
s = s.to_sparse() # TODO: specify kind?
if sparse_series:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok can you add a todo about this then, this is not obvious at all

@@ -215,6 +215,7 @@ def test_scalar_with_index_infer_dtype(self, scalar, dtype):
assert exp.dtype == dtype

@pytest.mark.parametrize("fill", [1, np.nan, 0])
@pytest.mark.filterwarnings("ignore:Sparse:FutureWarning")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need these as a prior PR added this to setup.cfg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup.cfg has an error::: config to elevate unhandled warnings to errors. We still need these otherwise the tests would fail.

We have a single test asserting that SparseSeries.__init__ warns, and explicitly ignore the warnings elsewhere.

@jreback jreback merged commit e7ad884 into pandas-dev:master May 29, 2019
@jreback
Copy link
Contributor

jreback commented May 29, 2019

thanks @TomAugspurger

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still a bunch of :okwarnings: needed in older whatsnew files.

~~~~~~~~~~~~~~~

.. versionadded:: 0.20.0
Use :meth:`DataFrame.sparse.from_coo` to create a ``DataFrame`` with sparse values from a sparse matrix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be from_spmatrix ?

class SparseSeries(Series):
"""Data structure for labeled, sparse floating point data

.. deprectaed:: 0.25.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented May 29, 2019 via email

@TomAugspurger TomAugspurger deleted the depr-sparse-depr branch May 29, 2019 14:02


class SparseDataFrame(DataFrame):
"""
DataFrame containing sparse floating point data in the form of SparseSeries
objects

.. deprectaed:: 0.25.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DEPR: SparseDataFrame and SparseSeries subclasses
6 participants